Code
install.packages("stringr")
install.packages("dplyr")
install.packages("flextable")
install.packages("checkdown") Martin Schweinberger


This tutorial introduces regular expressions (regex) and demonstrates how to use them when working with language data in R. A regular expression is a special sequence of characters that describes a search pattern. You can think of regular expressions as precision search tools — far more powerful than simple find-and-replace — that let you locate, extract, validate, and transform text based on its structure rather than its exact content.
Regular expressions have wide applications across linguistics and computational humanities: searching corpora for inflected forms, extracting named entities, cleaning OCR output, tokenising text, validating annotation schemes, and building text-processing pipelines. Once mastered, they become one of the most versatile tools in any language researcher’s toolkit.
Before working through this tutorial, please complete or familiarise yourself with:
.*, +, ?, {n,m})|\w, \d, \s, and their negationsstringr functions — str_detect(), str_extract(), str_replace(), and moredplyr pipelines — filtering and mutating with patternsFor further study, the following resources are highly recommended:
Install required packages (once only):
Load packages:
We will work with two types of objects throughout: a short example sentence for demonstrating individual patterns, and a longer example text representing realistic corpus data.
# Short example sentence for basic demonstrations
sent <- "The cat sat on the mat."
# A longer example text: an excerpt about linguistics
et <- paste(
"Grammar is the system of a language. People sometimes describe grammar as",
"the rules of a language, but in fact no language has rules. If we use the",
"word rules, we suggest that somebody created the rules first and then spoke",
"the language, like the rules of a game. But languages did not start like",
"that. Languages started when humans started to communicate with each other.",
"Grammars developed naturally. After some time, people described the grammar",
"of their languages. Languages change over time. Grammar changes too.",
"Children learn the grammar of their first language naturally. They do not",
"need to study it. Native speakers know intuitively whether a sentence is",
"grammatically correct or not. Non-native speakers often learn grammar rules",
"formally, through instruction. Prescriptive grammar describes how people",
"should speak, while descriptive grammar describes how people actually speak.",
"Linguists study grammars to understand language structure and acquisition.",
"The field of syntax deals with sentence structure, while morphology examines",
"how words are formed. Phonology studies sound systems in human languages.",
"Pragmatics investigates how context influences the interpretation of meaning.",
"Computational linguistics applies formal grammar to natural language processing.",
"Regular expressions are useful tools for searching and extracting patterns.",
"They can match words like 'cat', 'bat', or 'hat' with a single pattern."
)
# Split into individual tokens (words and punctuation)
tokens <- str_split(et, "\\s+") |> unlist() What you’ll learn: The building blocks of regular expressions — how each type of pattern works and what it matches
Key concept: Regular expressions describe structure, not content. [aeiou]{2,} matches any sequence of two or more vowels, regardless of which vowels or in which word.
The simplest regular expression is a literal character — it matches exactly that character. A sequence of literal characters matches that exact sequence:
[1] TRUE
[1] TRUE
[1] TRUE
[1] TRUE
To match a literal dot (rather than “any character”), escape it with a double backslash:
[1] TRUE
[1] TRUE
[1] FALSE
In most programming languages, a single backslash \ is the regex escape character. In R strings, \ itself must be escaped, so regex escapes require double backslash \\. For example:
\\. in R code → \. as a regex → matches a literal dot\\b in R code → \b as a regex → matches a word boundary\\d in R code → \d as a regex → matches a digitThis double-backslash requirement catches many beginners. Remember: every \ you intend for regex needs to be written as \\ in R.
Anchors match positions in the string, not characters. They constrain where in the string a pattern can match.
[1] TRUE
[1] FALSE
[1] TRUE
[1] FALSE
[1] FALSE
[1] TRUE
[1] FALSE
[1] TRUE
\b is indispensable for corpus searches. Without it, searching for “the” would match “the” inside “other”, “there”, “ather”, and so on. Always use \\bword\\b when you want whole-word matches.
A character class [...] matches any single character from the set listed inside the brackets:
[[1]]
[1] "cat" "sat" "mat"
[[1]]
[1] "T" "h" "c" "t" "s" "t" "n" "t" "h" "m" "t" "."
[[1]]
[1] "e" "l" "l" "o" "o" "r" "l" "d"
[[1]]
[1] "H" "W"
[[1]]
[1] "1" "2" "3"
[[1]]
[1] "H" "e" "l" "l" "o" "W" "o" "r" "l" "d"
R supports POSIX character classes — named sets written inside [:..:] inside an outer [...]:
[[1]]
[1] "H" "e" "l" "l" "o" "W" "o" "r" "l" "d"
[[1]]
[1] "1" "2" "3"
[[1]]
[1] "," "!" "."
[[1]]
[1] "H" "e" "l" "l" "o" "W" "o" "r" "l" "d" "1" "2" "3"
[[1]]
[1] "\t" " " " "
The full set of POSIX classes available in R:
Class | Matches |
|---|---|
[:alpha:] | Any letter (a–z, A–Z) |
[:lower:] | Lowercase letters (a–z) |
[:upper:] | Uppercase letters (A–Z) |
[:digit:] | Digits (0–9) |
[:alnum:] | Letters and digits |
[:punct:] | Punctuation: . , ; : ! ? " ' ( ) [ ] { } / \ @ # $ % ^ & * - _ + = ~ ` | |
[:space:] | All whitespace: space, tab, newline, return, form-feed |
[:blank:] | Space and tab only |
[:graph:] | All visible characters (alnum + punct) |
[:print:] | Printable characters (graph + space) |
Quantifiers specify how many times the preceding element should match:
[[1]]
[1] "" "" "bbb" "" "" "" "" ""
[[1]]
[1] "bbb"
[1] TRUE TRUE
[[1]]
[1] "communicat" "intuitivel" "grammatica" "instructio" "rescriptiv"
[6] "descriptiv" "understand" "acquisitio" "morphology" "investigat"
[11] "influences" "interpreta" "omputation" "linguistic" "processing"
[16] "expression" "extracting"
[[1]]
character(0)
[[2]]
character(0)
[[3]]
character(0)
[[4]]
character(0)
[[5]]
character(0)
[[6]]
character(0)
[[7]]
character(0)
[[8]]
character(0)
[[9]]
[1] "sometimes"
[[10]]
[1] "describe"
[[11]]
character(0)
[[12]]
character(0)
[[13]]
character(0)
[[14]]
character(0)
[[15]]
character(0)
[[16]]
character(0)
[[17]]
character(0)
[[18]]
character(0)
[[19]]
character(0)
[[20]]
character(0)
[[21]]
character(0)
[[22]]
[1] "language"
[[23]]
character(0)
[[24]]
character(0)
[[25]]
character(0)
[[26]]
character(0)
[[27]]
character(0)
[[28]]
character(0)
[[29]]
character(0)
[[30]]
character(0)
[[31]]
character(0)
[[32]]
character(0)
[[33]]
character(0)
[[34]]
[1] "somebody"
[[35]]
character(0)
[[36]]
character(0)
[[37]]
character(0)
[[38]]
character(0)
[[39]]
character(0)
[[40]]
character(0)
[[41]]
character(0)
[[42]]
character(0)
[[43]]
character(0)
[[44]]
character(0)
[[45]]
character(0)
[[46]]
character(0)
[[47]]
character(0)
[[48]]
character(0)
[[49]]
character(0)
[[50]]
character(0)
[[51]]
[1] "languages"
[[52]]
character(0)
[[53]]
character(0)
[[54]]
character(0)
[[55]]
character(0)
[[56]]
character(0)
[[57]]
[1] "Languages"
[[58]]
character(0)
[[59]]
character(0)
[[60]]
character(0)
[[61]]
character(0)
[[62]]
character(0)
[[63]]
[1] "communicate"
[[64]]
character(0)
[[65]]
character(0)
[[66]]
character(0)
[[67]]
[1] "Grammars"
[[68]]
[1] "developed"
[[69]]
character(0)
[[70]]
character(0)
[[71]]
character(0)
[[72]]
character(0)
[[73]]
character(0)
[[74]]
[1] "described"
[[75]]
character(0)
[[76]]
character(0)
[[77]]
character(0)
[[78]]
character(0)
[[79]]
character(0)
[[80]]
[1] "Languages"
[[81]]
character(0)
[[82]]
character(0)
[[83]]
character(0)
[[84]]
character(0)
[[85]]
character(0)
[[86]]
character(0)
[[87]]
[1] "Children"
[[88]]
character(0)
[[89]]
character(0)
[[90]]
character(0)
[[91]]
character(0)
[[92]]
character(0)
[[93]]
character(0)
[[94]]
[1] "language"
[[95]]
character(0)
[[96]]
character(0)
[[97]]
character(0)
[[98]]
character(0)
[[99]]
character(0)
[[100]]
character(0)
[ reached getOption("max.print") -- omitted 110 entries ]
[[1]]
character(0)
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "system"
[[5]]
character(0)
[[6]]
character(0)
[[7]]
character(0)
[[8]]
[1] "People"
[[9]]
character(0)
[[10]]
character(0)
[[11]]
character(0)
[[12]]
character(0)
[[13]]
character(0)
[[14]]
[1] "rules"
[[15]]
character(0)
[[16]]
character(0)
[[17]]
character(0)
[[18]]
character(0)
[[19]]
character(0)
[[20]]
[1] "fact"
[[21]]
character(0)
[[22]]
character(0)
[[23]]
character(0)
[[24]]
character(0)
[[25]]
character(0)
[[26]]
character(0)
[[27]]
character(0)
[[28]]
character(0)
[[29]]
[1] "word"
[[30]]
character(0)
[[31]]
character(0)
[[32]]
character(0)
[[33]]
[1] "that"
[[34]]
character(0)
[[35]]
character(0)
[[36]]
character(0)
[[37]]
[1] "rules"
[[38]]
[1] "first"
[[39]]
character(0)
[[40]]
[1] "then"
[[41]]
[1] "spoke"
[[42]]
character(0)
[[43]]
character(0)
[[44]]
[1] "like"
[[45]]
character(0)
[[46]]
[1] "rules"
[[47]]
character(0)
[[48]]
character(0)
[[49]]
character(0)
[[50]]
character(0)
[[51]]
character(0)
[[52]]
character(0)
[[53]]
character(0)
[[54]]
[1] "start"
[[55]]
[1] "like"
[[56]]
character(0)
[[57]]
character(0)
[[58]]
character(0)
[[59]]
[1] "when"
[[60]]
[1] "humans"
[[61]]
character(0)
[[62]]
character(0)
[[63]]
character(0)
[[64]]
[1] "with"
[[65]]
[1] "each"
[[66]]
character(0)
[[67]]
character(0)
[[68]]
character(0)
[[69]]
character(0)
[[70]]
[1] "After"
[[71]]
[1] "some"
[[72]]
character(0)
[[73]]
[1] "people"
[[74]]
character(0)
[[75]]
character(0)
[[76]]
character(0)
[[77]]
character(0)
[[78]]
[1] "their"
[[79]]
character(0)
[[80]]
character(0)
[[81]]
[1] "change"
[[82]]
[1] "over"
[[83]]
character(0)
[[84]]
character(0)
[[85]]
character(0)
[[86]]
character(0)
[[87]]
character(0)
[[88]]
[1] "learn"
[[89]]
character(0)
[[90]]
character(0)
[[91]]
character(0)
[[92]]
[1] "their"
[[93]]
[1] "first"
[[94]]
character(0)
[[95]]
character(0)
[[96]]
[1] "They"
[[97]]
character(0)
[[98]]
character(0)
[[99]]
[1] "need"
[[100]]
character(0)
[ reached getOption("max.print") -- omitted 110 entries ]
By default, quantifiers are greedy — they match as much as possible. Adding ? after a quantifier makes it lazy — it matches as little as possible:
[1] "<b>bold</b> and <i>italic</i>"
[1] "<b>"
[[1]]
[1] "<b>" "</b>" "<i>" "</i>"
Parentheses () create a capturing group — a sub-pattern whose match can be referenced or extracted separately. The alternation operator | means OR within a group or pattern.
[1] TRUE TRUE FALSE
[[1]]
[1] "colour"
[[2]]
[1] "color"
[[1]]
character(0)
[1] TRUE
Use (?:...) when you need to group for alternation or quantification but do not need to capture the match:
R supports shorthand escape sequences for common character classes:
Sequence (in R code) | Matches | Example (R string) |
|---|---|---|
\\w | Word characters: [[:alnum:]_] | "\\w+" |
\\W | Non-word characters: [^[:alnum:]_] | "\\W+" |
\\d | Digits: [[:digit:]] | "\\d+" |
\\D | Non-digits: [^[:digit:]] | "\\D+" |
\\s | Whitespace: [[:space:]] | "\\s+" |
\\S | Non-whitespace: [^[:space:]] | "\\S+" |
\\b | Word boundary (position) | "\\bcat\\b" |
\\B | Non-word boundary (position) | "\\Bcat\\B" |
[[1]]
[1] "price" "4" "99"
[[1]]
[1] "07" "3365" "1234" "07" "3346" "5678"
[1] "word1" "word2" "word3" "word4"
[[1]]
[1] "grammar"
Lookaround assertions match a position based on what comes before or after it, without including that context in the match. They are essential for extracting values that are preceded or followed by specific markers.
Syntax | Name | Matches |
|---|---|---|
(?=...) | Positive lookahead | Position followed by ... |
(?!...) | Negative lookahead | Position NOT followed by ... |
(?<=...) | Positive lookbehind | Position preceded by ... |
(?<!...) | Negative lookbehind | Position NOT preceded by ... |
[[1]]
[1] "12"
[[2]]
[1] "4"
[[3]]
[1] "7"
[[4]]
[1] "8"
[[1]]
[1] "12.99"
[[2]]
[1] "4.50"
[[3]]
character(0)
[[4]]
character(0)
[[1]]
character(0)
[[2]]
character(0)
[[3]]
[1] "7.00"
[[4]]
[1] "8.95"
A linguistic example — extract words that come before a comma:
[[1]]
[1] "Grammar" "syntax"
Q1. What does the regex ^[A-Z] match?
Q2. What is the difference between colou?r and colo[u]?r?
Q3. You want to match words of exactly 5 characters that consist only of lowercase letters. Which pattern is correct?
stringr FunctionsWhat you’ll learn: The stringr functions used most frequently with regular expressions, and when to use each
Key functions: str_detect(), str_count(), str_extract(), str_extract_all(), str_replace(), str_replace_all(), str_remove(), str_remove_all(), str_split(), str_locate()
The stringr package provides a consistent, user-friendly interface to regular expressions in R. All stringr functions follow the same pattern: the string comes first, the pattern second.
str_detect() — Does the Pattern Exist?Returns TRUE/FALSE for each string in a vector. Most commonly used for filtering:
[1] TRUE FALSE FALSE FALSE FALSE TRUE TRUE
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE
[1] "syntax" "morphology" "phonology" "pragmatics"
str_count() — How Many Times?Counts non-overlapping occurrences of a pattern within each string:
str_extract() and str_extract_all() — What Matches?str_extract() returns the first match in each string. str_extract_all() returns all matches as a list:
[1] NA "synt" "rph" NA NA NA "ngr"
[[1]]
[1] "12" "99"
[[2]]
[1] "4" "12"
[[3]]
[1] "2024"
[1] "acquisition" "communicate" "Computational" "described"
[5] "describes" "descriptive" "developed" "expressions"
[9] "extracting" "grammatically" "influences" "instruction"
[13] "interpretation" "intuitively" "investigates" "languages"
[17] "Languages" "linguistics" "Linguists" "morphology"
[21] "naturally" "Phonology" "Pragmatics" "Prescriptive"
[25] "processing" "searching" "sometimes" "structure"
[29] "understand"
str_replace() and str_replace_all()Replace the first (or all) occurrence(s) of a pattern with a replacement string. Backreferences (\\1, \\2) refer to captured groups in the replacement:
[1] "The dog sat on the mat."
[1] "The dog dog on the dog."
[1] "dogs and cats"
[1] "Grammar is the system of a **language**. People **sometimes** **describe** grammar as the rules of a **language**, but i"
str_remove() and str_remove_all()Shorthand for str_replace(x, pattern, "") and str_replace_all(x, pattern, ""):
[1] "The cat sat on the mat"
[1] "Call us on --"
[1] "linguistics"
[1] "Grammar" "system" "People" "sometimes" "describe" "grammar"
[7] "rules" "fact" "language" "word"
str_split()Split strings on a pattern, returning a list:
[1] "the" "cat" "sat" "on" "the" "mat"
[1] "one" "two" "three" "four"
[1] "Grammar is the system of a language."
[2] "People sometimes describe grammar as the rules of a language, but in fact no language has rules."
[3] "If we use the word rules, we suggest that somebody created the rules first and then spoke the language, like the rules of a game."
str_locate() — Where Is the Match?Returns the start and end positions of matches — useful when you need to know where in the string a pattern occurs:
start end
[1,] 64 70
start end
[1,] 64 70
[2,] 442 448
[3,] 538 544
[4,] 728 734
[5,] 786 792
[6,] 847 853
[7,] 1237 1243
stringr Functions
Q1. What is the difference between str_extract() and str_extract_all()?
Q2. You want to capitalise all words longer than 5 characters in a text. Which stringr function would you use?
What you’ll learn: How to apply regular expressions to realistic corpus linguistics and text processing tasks
Tasks covered: Corpus searching, text cleaning, extraction, frequency analysis, and dplyr integration
A common corpus task is retrieving all contexts in which a pattern appears. We simulate a small multi-document corpus:
set.seed(42)
corpus <- data.frame(
doc_id = paste0("doc", 1:10),
register = rep(c("Academic", "News"), each = 5),
text = c(
"Grammar is the systematic study of the structure of a language.",
"Morphology examines how words are formed from smaller units called morphemes.",
"Syntax deals with the arrangement of words to form grammatical sentences.",
"Phonology studies the sound systems and phonological rules of languages.",
"Pragmatics investigates how context and intention affect meaning in communication.",
"Scientists announced a major breakthrough in natural language processing yesterday.",
"The new grammar checker software was released to the public on Monday morning.",
"Researchers found that bilingual speakers process syntax differently than monolinguals.",
"Language acquisition in children follows predictable phonological and syntactic stages.",
"The government launched a literacy program to improve grammar skills in schools."
),
stringsAsFactors = FALSE
) doc_id register
1 doc2 Academic
2 doc4 Academic
text
1 Morphology examines how words are formed from smaller units called morphemes.
2 Phonology studies the sound systems and phonological rules of languages.
doc_id ology_words
1 doc2 Morphology
2 doc4 Phonology
doc_id register n_grammar
1 doc1 Academic 1
2 doc7 News 1
3 doc10 News 1
4 doc2 Academic 0
5 doc3 Academic 0
6 doc4 Academic 0
7 doc5 Academic 0
8 doc6 News 0
9 doc8 News 0
10 doc9 News 0
# Count different syntactic subfields mentioned
subfields <- c("syntax", "morphology", "phonology", "pragmatics", "grammar")
subfield_counts <- sapply(subfields, function(sf)
sum(str_count(corpus$text, regex(sf, ignore_case = TRUE))))
data.frame(subfield = subfields, count = subfield_counts) |>
dplyr::arrange(dplyr::desc(count)) |>
flextable() |>
flextable::set_table_properties(width = .4, layout = "autofit") |>
flextable::theme_zebra() |>
flextable::fontsize(size = 12) |>
flextable::fontsize(size = 12, part = "header") |>
flextable::align_text_col(align = "center") |>
flextable::set_caption(caption = "Frequency of linguistic subfield terms in the corpus.") |>
flextable::border_outer() subfield | count |
|---|---|
grammar | 3 |
syntax | 2 |
morphology | 1 |
phonology | 1 |
pragmatics | 1 |
Regular expressions are the primary tool for cleaning raw corpus text:
raw_texts <- c(
" Grammar is the system of a language. ",
"Words like 'cat', 'bat', and 'hat' rhyme!",
"Phone: +61-7-3365-1234 | Email: info@uq.edu.au",
"Chapter 4: Syntax (pp. 112--145) — see also §3.2",
"The year\t2024\twas notable for advances in NLP."
)
raw_texts |>
# Normalise whitespace (collapse multiple spaces/tabs to one space)
str_replace_all("\\s+", " ") |>
# Remove leading and trailing whitespace
str_trim() |>
# Remove content in parentheses
str_remove_all("\\(.*?\\)") |>
# Remove section references (§3.2 etc.)
str_remove_all("§\\d+\\.\\d+") |>
# Remove em dashes and extra spaces left behind
str_remove_all("—\\s*") |>
# Trim again after removals
str_trim() [1] "Grammar is the system of a language."
[2] "Words like 'cat', 'bat', and 'hat' rhyme!"
[3] "Phone: +61-7-3365-1234 | Email: info@uq.edu.au"
[4] "Chapter 4: Syntax see also"
[5] "The year 2024 was notable for advances in NLP."
A powerful application of regex is extracting structured information from free text:
# Simulate file names with embedded metadata
file_names <- c(
"speaker01_female_academic_2019.txt",
"speaker14_male_news_2021.txt",
"speaker07_female_fiction_2020.txt",
"speaker23_male_academic_2022.txt"
)
# Extract each metadata component
data.frame(
filename = file_names,
speaker_id = str_extract(file_names, "speaker\\d+"),
gender = str_extract(file_names, "(?<=_)(female|male)(?=_)"),
register = str_extract(file_names, "(?<=_(female|male)_)\\w+"),
year = str_extract(file_names, "\\d{4}")
) filename speaker_id gender register year
1 speaker01_female_academic_2019.txt speaker01 female academic_2019 2019
2 speaker14_male_news_2021.txt speaker14 male news_2021 2021
3 speaker07_female_fiction_2020.txt speaker07 female fiction_2020 2020
4 speaker23_male_academic_2022.txt speaker23 male academic_2022 2022
By default, regex in stringr is case-sensitive. Use regex(..., ignore_case = TRUE) to match regardless of case:
[1] TRUE TRUE TRUE TRUE
[1] "Grammar" "grammar" "Grammars" "grammar" "Grammar" "grammar"
[7] "grammar" "grammar" "grammar" "grammars" "grammar"
dplyr PipelinesRegular expressions integrate seamlessly with dplyr for filtering and creating new columns:
corpus |>
# Filter: keep only documents mentioning a specific pattern
dplyr::filter(str_detect(text, regex("syntax|morphology", ignore_case = TRUE))) |>
# Mutate: extract the first linguistic subfield mentioned
dplyr::mutate(
primary_topic = str_extract(text,
regex("syntax|morphology|phonology|pragmatics|grammar",
ignore_case = TRUE)),
n_words = str_count(text, "\\S+"),
has_definition = str_detect(text, "\\bis\\b|\\bdeals with\\b|\\bexamines\\b")
) |>
dplyr::select(doc_id, register, primary_topic, n_words, has_definition) doc_id register primary_topic n_words has_definition
1 doc2 Academic Morphology 11 TRUE
2 doc3 Academic Syntax 11 TRUE
3 doc8 News syntax 10 FALSE
Q1. What regular expression would you use to extract all words that contain at least one digit (e.g., “A4”, “mp3”, “COVID-19”)?
Q2. You want to extract the domain name from email addresses (the part after @ and before the final .). Which regex extracts uq from user@uq.edu.au?
Q3. What does str_replace_all(text, \"(\\\\w+) and (\\\\w+)\", \"\\\\2 and \\\\1\") do?
Ten practical exercises covering the most common corpus-search regex tasks
Each question asks you to identify the correct regular expression for a realistic search task on a tokenised text vector. All answers use stringr::str_detect() applied to a character vector called text.
Q1. Which regex extracts all forms of walk from a tokenised text (walk, walks, walked, walking, walker)?
Q2. Which regex extracts all words beginning with “un” (e.g., ungrammatical, unusual, undo)?
Q3. Which regex finds all numeric tokens (whole numbers like 2024, 42, 100)?
Q4. Which regex extracts all words ending in -ing (e.g., running, working, thinking)?
Q5. Which regex matches email addresses (e.g., cat@uq.edu.au, info@ladal.edu.au)?
Q6. Which regex identifies tokens that contain at least one digit mixed with letters (e.g., mp3, A4, COVID-19, type2)?
Q7. Which regex extracts hyphenated compound words (e.g., well-being, self-aware, long-term)?
Q8. Which regex finds capitalised tokens — words beginning with an uppercase letter followed by lowercase letters (e.g., proper nouns like London, Paris, Grammar)?
Q9. Which regex finds tokens that are questions ending with a question mark (e.g., you?, this?)?
Q10. Which regex finds tokens containing double vowels (e.g., agreement, book, see, moon)?
A compact reference for the most commonly used regex elements in R
Pattern | Meaning |
|---|---|
. | Any character except newline |
^ | Start of string / line |
$ | End of string / line |
\\b | Word boundary |
\\B | Non-word boundary |
[abc] | One of: a, b, or c |
[^abc] | Not a, b, or c |
[a-z] | Lowercase letter |
[[:alpha:]] | Any letter |
[[:digit:]] | Any digit |
[[:punct:]] | Any punctuation |
* | 0 or more (greedy) |
+ | 1 or more (greedy) |
? | 0 or 1 (optional) |
{n} | Exactly n times |
{n,} | n or more times |
{n,m} | Between n and m times |
(abc) | Capturing group |
(?:abc) | Non-capturing group |
a|b | a or b |
\\w | Word character [a-zA-Z0-9_] |
\\d | Digit [0-9] |
\\s | Whitespace |
\\W | Non-word character |
\\D | Non-digit |
\\S | Non-whitespace |
(?=...) | Positive lookahead |
(?!...) | Negative lookahead |
(?<=...) | Positive lookbehind |
(?<!...) | Negative lookbehind |
stringr Function SummaryFunction | Returns |
|---|---|
str_detect(x, p) | logical vector — does p match? |
str_count(x, p) | integer vector — how many matches? |
str_extract(x, p) | character vector — first match (NA if none) |
str_extract_all(x, p) | list of character vectors — all matches |
str_replace(x, p, r) | character vector — first match replaced |
str_replace_all(x, p, r) | character vector — all matches replaced |
str_remove(x, p) | character vector — first match removed |
str_remove_all(x, p) | character vector — all matches removed |
str_split(x, p) | list of character vectors — parts between matches |
str_locate(x, p) | integer matrix — start and end of first match |
str_locate_all(x, p) | list of integer matrices — all match positions |
str_starts(x, p) | logical — does x start with p? |
str_ends(x, p) | logical — does x end with p? |
Schweinberger, Martin. 2026. Regular Expressions in R. Brisbane: The University of Queensland. url: https://ladal.edu.au/tutorials/regex/regex.html (Version 2026.02.19).
@manual{schweinberger2026regex,
author = {Schweinberger, Martin},
title = {Regular Expressions in R},
note = {https://ladal.edu.au/tutorials/regex/regex.html},
year = {2026},
organization = {The University of Queensland, Australia. School of Languages and Cultures},
address = {Brisbane},
edition = {2026.02.19}
}
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
time zone: Australia/Brisbane
tzcode source: internal
attached base packages:
[1] stats graphics grDevices datasets utils methods base
other attached packages:
[1] checkdown_0.0.13 flextable_0.9.7 lubridate_1.9.4 forcats_1.0.0
[5] stringr_1.5.1 dplyr_1.2.0 purrr_1.0.4 readr_2.1.5
[9] tidyr_1.3.2 tibble_3.2.1 ggplot2_4.0.2 tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] generics_0.1.3 fontLiberation_0.1.0 renv_1.1.1
[4] xml2_1.3.6 stringi_1.8.4 hms_1.1.3
[7] digest_0.6.39 magrittr_2.0.3 evaluate_1.0.3
[10] grid_4.4.2 timechange_0.3.0 RColorBrewer_1.1-3
[13] fastmap_1.2.0 jsonlite_1.9.0 zip_2.3.2
[16] scales_1.4.0 fontBitstreamVera_0.1.1 codetools_0.2-20
[19] textshaping_1.0.0 cli_3.6.4 rlang_1.1.7
[22] fontquiver_0.2.1 litedown_0.9 commonmark_2.0.0
[25] withr_3.0.2 yaml_2.3.10 gdtools_0.4.1
[28] tools_4.4.2 officer_0.6.7 uuid_1.2-1
[31] tzdb_0.4.0 vctrs_0.7.1 R6_2.6.1
[34] lifecycle_1.0.5 htmlwidgets_1.6.4 ragg_1.3.3
[37] pkgconfig_2.0.3 pillar_1.10.1 gtable_0.3.6
[40] glue_1.8.0 data.table_1.17.0 Rcpp_1.0.14
[43] systemfonts_1.2.1 xfun_0.56 tidyselect_1.2.1
[46] rstudioapi_0.17.1 knitr_1.51 farver_2.1.2
[49] htmltools_0.5.9 rmarkdown_2.30 compiler_4.4.2
[52] S7_0.2.1 markdown_2.0 askpass_1.2.1
[55] openssl_2.3.2